- Learn the main dplyr verbs
- Understand how pipes work
June 19, 2018
Practical is on RStudio cloud : https://rstudio.cloud/project/42122
It needs an account, but these are free!
head(gapminder, n = 2)
## # A tibble: 2 x 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. ## 2 Afghanistan Asia 1957 30.3 9240934 821.
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables: ## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ... ## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ... ## $ lifeExp : num 28.8 30.3 32 34 36.1 ... ## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ... ## $ gdpPercap: num 779 821 853 836 740 ...
dim(gapminder)
## [1] 1704 6
# single filter filter(gapminder, year == 1997)
Strings are characters/words/letters
Strings need to be in quotes
filter(gapminder,
country == "United Kingdom")
## # A tibble: 12 x 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 United Kingdom Europe 1952 69.2 50430000 9980. ## 2 United Kingdom Europe 1957 70.4 51430000 11283. ## 3 United Kingdom Europe 1962 70.8 53292000 12477. ## 4 United Kingdom Europe 1967 71.4 54959000 14143. ## 5 United Kingdom Europe 1972 72.0 56079000 15895. ## 6 United Kingdom Europe 1977 72.8 56179000 17429. ## 7 United Kingdom Europe 1982 74.0 56339704 18232. ## 8 United Kingdom Europe 1987 75.0 56981620 21665. ## 9 United Kingdom Europe 1992 76.4 57866349 22705. ## 10 United Kingdom Europe 1997 77.2 58808266 26075. ## 11 United Kingdom Europe 2002 78.5 59912431 29479. ## 12 United Kingdom Europe 2007 79.4 60776238 33203.
| Relational Operations | Outputs TRUE or FALSE based on the validity of the statement. |
|---|---|
| a == b | a is equal to b |
| a != b | a is not equal to b |
| a > b | a is greater than b |
| a < b | a is less than b |
| a >= b | a is greater than or equal to b |
| a <= b | a is less than or equal to b |
| a %in% b | a is an element in b |
Table from http://stat545.com/cm005-notes_and_exercises.html
%in% - value match - Does this exist in that
Filter year to only keep those that are either 1977 or 1997
filter(gapminder,
year %in% c(1977, 1997))
## # A tibble: 284 x 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1977 38.4 14880372 786. ## 2 Afghanistan Asia 1997 41.8 22227415 635. ## 3 Albania Europe 1977 68.9 2509048 3533. ## 4 Albania Europe 1997 73.0 3428038 3193. ## 5 Algeria Africa 1977 58.0 17152804 4910. ## 6 Algeria Africa 1997 69.2 29072015 4797. ## 7 Angola Africa 1977 39.5 6162675 3009. ## 8 Angola Africa 1997 41.0 9875024 2277. ## 9 Argentina Americas 1977 68.5 26983828 10079. ## 10 Argentina Americas 1997 73.3 36203463 10967. ## # ... with 274 more rows
Multiple filters may be separated by a , (comma)
dplyr interprets the , (comma) to be & (and)
filter(gapminder,
year == 1997, lifeExp > 78)
## # A tibble: 13 x 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Australia Oceania 1997 78.8 18565243 26998. ## 2 Canada Americas 1997 78.6 30305843 28955. ## 3 France Europe 1997 78.6 58623428 25890. ## 4 Hong Kong, China Asia 1997 80 6495918 28378. ## 5 Iceland Europe 1997 79.0 271192 28061. ## 6 Israel Asia 1997 78.3 5531387 20897. ## 7 Italy Europe 1997 78.8 57479469 24675. ## 8 Japan Asia 1997 80.7 125956499 28817. ## 9 Netherlands Europe 1997 78.0 15604464 30246. ## 10 Norway Europe 1997 78.3 4405672 41283. ## 11 Spain Europe 1997 78.8 39855442 20445. ## 12 Sweden Europe 1997 79.4 8897619 25267. ## 13 Switzerland Europe 1997 79.4 7193761 32135.
| Logical Operations | Outputs TRUE or FALSE based on the validity of the statement. |
|---|---|
| a & b, a && b | Both a and b are TRUE |
| a | b, a |
| !a | a is not TRUE (take everything else but a) |
| xor(a, b) | Either a or b is TRUE, but not both. |
| all(a,b,c,…) | a, b, c, . are all TRUE. |
| any(a,b,c,…) | Any one of a, b, c, . is TRUE. |
Table from http://stat545.com/cm005-notes_and_exercises.html
Keep a row if either criteria is met
filter(gapminder,
pop > 350000000 | lifeExp > 80)
## # A tibble: 45 x 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Australia Oceania 2002 80.4 19546792 30688. ## 2 Australia Oceania 2007 81.2 20434176 34435. ## 3 Canada Americas 2007 80.7 33390141 36319. ## 4 China Asia 1952 44 556263527 400. ## 5 China Asia 1957 50.5 637408000 576. ## 6 China Asia 1962 44.5 665770000 488. ## 7 China Asia 1967 58.4 754550000 613. ## 8 China Asia 1972 63.1 862030000 677. ## 9 China Asia 1977 64.0 943455000 741. ## 10 China Asia 1982 65.5 1000281000 962. ## # ... with 35 more rows
between() can be used to find observations between a range.
Instead of writing this:
filter(gapminder,
year >= 1977 | year <= 1997)
We can do this
filter(gapminder,
between(year, 1977, 1997) )
## # A tibble: 710 x 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1977 38.4 14880372 786. ## 2 Afghanistan Asia 1982 39.9 12881816 978. ## 3 Afghanistan Asia 1987 40.8 13867957 852. ## 4 Afghanistan Asia 1992 41.7 16317921 649. ## 5 Afghanistan Asia 1997 41.8 22227415 635. ## 6 Albania Europe 1977 68.9 2509048 3533. ## 7 Albania Europe 1982 70.4 2780097 3631. ## 8 Albania Europe 1987 72 3075321 3739. ## 9 Albania Europe 1992 71.6 3326498 2497. ## 10 Albania Europe 1997 73.0 3428038 3193. ## # ... with 700 more rows
R will run the & before the |
Keep only rows where country is Canada or France and the year is 1982
filter(gapminder,
country == 'Canada' | country == 'France' & year == 1982 )
## # A tibble: 13 x 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Canada Americas 1952 68.8 14785584 11367. ## 2 Canada Americas 1957 70.0 17010154 12490. ## 3 Canada Americas 1962 71.3 18985849 13462. ## 4 Canada Americas 1967 72.1 20819767 16077. ## 5 Canada Americas 1972 72.9 22284500 18971. ## 6 Canada Americas 1977 74.2 23796400 22091. ## 7 Canada Americas 1982 75.8 25201900 22899. ## 8 Canada Americas 1987 76.9 26549700 26627. ## 9 Canada Americas 1992 78.0 28523502 26343. ## 10 Canada Americas 1997 78.6 30305843 28955. ## 11 Canada Americas 2002 79.8 31902268 33329. ## 12 Canada Americas 2007 80.7 33390141 36319. ## 13 France Europe 1982 74.9 54433565 20294.
R will run the & before the |
Keep only rows where country is Canada or France and the year is 1982
filter(gapminder,
(country == 'Canada' | country == 'France') & year == 1982)
## # A tibble: 2 x 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Canada Americas 1982 75.8 25201900 22899. ## 2 France Europe 1982 74.9 54433565 20294.
Use !is.na() to filter out missing data
head(msleep, n = 2)
## # A tibble: 2 x 11 ## name genus vore order conservation sleep_total sleep_rem sleep_cycle ## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> ## 1 Cheet~ Acino~ carni Carn~ lc 12.1 NA NA ## 2 Owl m~ Aotus omni Prim~ <NA> 17 1.8 NA ## # ... with 3 more variables: awake <dbl>, brainwt <dbl>, bodywt <dbl>
filter(msleep,
!is.na(conservation))
## # A tibble: 54 x 11 ## name genus vore order conservation sleep_total sleep_rem sleep_cycle ## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> ## 1 Cheet~ Acin~ carni Carn~ lc 12.1 NA NA ## 2 Mount~ Aplo~ herbi Rode~ nt 14.4 2.4 NA ## 3 Great~ Blar~ omni Sori~ lc 14.9 2.3 0.133 ## 4 Cow Bos herbi Arti~ domesticated 4 0.7 0.667 ## 5 North~ Call~ carni Carn~ vu 8.7 1.4 0.383 ## 6 Dog Canis carni Carn~ domesticated 10.1 2.9 0.333 ## 7 Roe d~ Capr~ herbi Arti~ lc 3 NA NA ## 8 Goat Capri herbi Arti~ lc 5.3 0.6 NA ## 9 Guine~ Cavis herbi Rode~ domesticated 9.4 0.8 0.217 ## 10 Grivet Cerc~ omni Prim~ lc 10 0.7 NA ## # ... with 44 more rows, and 3 more variables: awake <dbl>, brainwt <dbl>, ## # bodywt <dbl>
# single filter filter(gapminder, year == 1997) filter(gapminder, continent == "Asia") # multiple filters filter(gapminder, year == 1997 & gdpPercap > 1000) filter(gapminder, year %in% c(1977, 1997))
Use select to keep or remove certain columns from a data sets.
select(gapminder, country, year, lifeExp)
Select all columns from country to life expectancy (lifeExp)
select(gapminder, country:lifeExp)
## # A tibble: 1,704 x 4 ## country continent year lifeExp ## <fct> <fct> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 ## 2 Afghanistan Asia 1957 30.3 ## 3 Afghanistan Asia 1962 32.0 ## 4 Afghanistan Asia 1967 34.0 ## 5 Afghanistan Asia 1972 36.1 ## 6 Afghanistan Asia 1977 38.4 ## 7 Afghanistan Asia 1982 39.9 ## 8 Afghanistan Asia 1987 40.8 ## 9 Afghanistan Asia 1992 41.7 ## 10 Afghanistan Asia 1997 41.8 ## # ... with 1,694 more rows
Select all columns BUT country
Using - to not select a specific column
Can also be vectorized or used with helper functions
select(gapminder, -country)
## # A tibble: 1,704 x 5 ## continent year lifeExp pop gdpPercap ## <fct> <int> <dbl> <int> <dbl> ## 1 Asia 1952 28.8 8425333 779. ## 2 Asia 1957 30.3 9240934 821. ## 3 Asia 1962 32.0 10267083 853. ## 4 Asia 1967 34.0 11537966 836. ## 5 Asia 1972 36.1 13079460 740. ## 6 Asia 1977 38.4 14880372 786. ## 7 Asia 1982 39.9 12881816 978. ## 8 Asia 1987 40.8 13867957 852. ## 9 Asia 1992 41.7 16317921 649. ## 10 Asia 1997 41.8 22227415 635. ## # ... with 1,694 more rows
Select all columns that start with sleep
select(msleep, starts_with("sleep"))
## # A tibble: 83 x 3 ## sleep_total sleep_rem sleep_cycle ## <dbl> <dbl> <dbl> ## 1 12.1 NA NA ## 2 17 1.8 NA ## 3 14.4 2.4 NA ## 4 14.9 2.3 0.133 ## 5 4 0.7 0.667 ## 6 14.4 2.2 0.767 ## 7 8.7 1.4 0.383 ## 8 7 NA NA ## 9 10.1 2.9 0.333 ## 10 3 NA NA ## # ... with 73 more rows
Allows you to create a character vector of column names and use it to select columns from dataframe
to_keep <- c('country', 'lifeExp', 'gdpPercap')
select(gapminder, one_of(to_keep))
## # A tibble: 1,704 x 3 ## country lifeExp gdpPercap ## <fct> <dbl> <dbl> ## 1 Afghanistan 28.8 779. ## 2 Afghanistan 30.3 821. ## 3 Afghanistan 32.0 853. ## 4 Afghanistan 34.0 836. ## 5 Afghanistan 36.1 740. ## 6 Afghanistan 38.4 786. ## 7 Afghanistan 39.9 978. ## 8 Afghanistan 40.8 852. ## 9 Afghanistan 41.7 649. ## 10 Afghanistan 41.8 635. ## # ... with 1,694 more rows
select(gapminder, Life_Expectancy = lifeExp)
## # A tibble: 1,704 x 1 ## Life_Expectancy ## <dbl> ## 1 28.8 ## 2 30.3 ## 3 32.0 ## 4 34.0 ## 5 36.1 ## 6 38.4 ## 7 39.9 ## 8 40.8 ## 9 41.7 ## 10 41.8 ## # ... with 1,694 more rows
rename(gapminder, Life_Expectancy = lifeExp)
## # A tibble: 1,704 x 6 ## country continent year Life_Expectancy pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. ## 2 Afghanistan Asia 1957 30.3 9240934 821. ## 3 Afghanistan Asia 1962 32.0 10267083 853. ## 4 Afghanistan Asia 1967 34.0 11537966 836. ## 5 Afghanistan Asia 1972 36.1 13079460 740. ## 6 Afghanistan Asia 1977 38.4 14880372 786. ## 7 Afghanistan Asia 1982 39.9 12881816 978. ## 8 Afghanistan Asia 1987 40.8 13867957 852. ## 9 Afghanistan Asia 1992 41.7 16317921 649. ## 10 Afghanistan Asia 1997 41.8 22227415 635. ## # ... with 1,694 more rows
select(gapminder, country, year, lifeExp)
# vectorized
select(gapminder, country:year)
# select all columns from country to year
# Not select a column
select(gapminder, -country)
# select all columns BUT country
select(gapminder, contains("co"))
# multiple steps with intermediate dataframes created
new_df <- filter(gapminder, year == 1997)
select(new_df, -year)
# nested commands
select(
filter(gapminder,
year == 1997),
-year)
# using a pipe %>%
gapminder %>%
filter(year == 1997) %>%
select(-year)
The pipe %>% takes the output from the previous command as the input the next command
Input data information for each command is now missing
The input is the output of the function above!
subsequent "piped" commands are indented - makes code easier to read
gapminder %>% filter(year == 1997) %>% select(-year)
## # A tibble: 142 x 5 ## country continent lifeExp pop gdpPercap ## <fct> <fct> <dbl> <int> <dbl> ## 1 Afghanistan Asia 41.8 22227415 635. ## 2 Albania Europe 73.0 3428038 3193. ## 3 Algeria Africa 69.2 29072015 4797. ## 4 Angola Africa 41.0 9875024 2277. ## 5 Argentina Americas 73.3 36203463 10967. ## 6 Australia Oceania 78.8 18565243 26998. ## 7 Austria Europe 77.5 8069876 29096. ## 8 Bahrain Asia 73.9 598561 20292. ## 9 Bangladesh Asia 59.4 123315288 973. ## 10 Belgium Europe 77.5 10199787 27561. ## # ... with 132 more rows
gapminder %>% filter(year == 1997) %>% select(-year)
gdp_billion is the name of the newly created variable
= gdpPercap*pop/10^9 is the equation to make the new variable
mutate(gapminder, gdp_billion = gdpPercap*pop/10^9)
There is no column name set for the new variable - defaults to the equation
= gdpPercap*pop/10^9 is the equation to make the new variable
mutate(gapminder, gdpPercap*pop/10^9)
## # A tibble: 1,704 x 7 ## country continent year lifeExp pop gdpPercap `gdpPercap * po~ ## <fct> <fct> <int> <dbl> <int> <dbl> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. 6.57 ## 2 Afghanistan Asia 1957 30.3 9240934 821. 7.59 ## 3 Afghanistan Asia 1962 32.0 10267083 853. 8.76 ## 4 Afghanistan Asia 1967 34.0 11537966 836. 9.65 ## 5 Afghanistan Asia 1972 36.1 13079460 740. 9.68 ## 6 Afghanistan Asia 1977 38.4 14880372 786. 11.7 ## 7 Afghanistan Asia 1982 39.9 12881816 978. 12.6 ## 8 Afghanistan Asia 1987 40.8 13867957 852. 11.8 ## 9 Afghanistan Asia 1992 41.7 16317921 649. 10.6 ## 10 Afghanistan Asia 1997 41.8 22227415 635. 14.1 ## # ... with 1,694 more rows
Make sure you set column names when using mutate()
Each new column is separated by a , (comma)
mutate(gapminder,
gdp_billion = gdpPercap*pop/10^9,
life_years = lifeExp * pop,
life_gdp = life_years / gdp_billion)
## # A tibble: 1,704 x 9 ## country continent year lifeExp pop gdpPercap gdp_billion life_years ## <fct> <fct> <int> <dbl> <int> <dbl> <dbl> <dbl> ## 1 Afghan~ Asia 1952 28.8 8.43e6 779. 6.57 242658016. ## 2 Afghan~ Asia 1957 30.3 9.24e6 821. 7.59 280296010. ## 3 Afghan~ Asia 1962 32.0 1.03e7 853. 8.76 328515855. ## 4 Afghan~ Asia 1967 34.0 1.15e7 836. 9.65 392521603. ## 5 Afghan~ Asia 1972 36.1 1.31e7 740. 9.68 472011552. ## 6 Afghan~ Asia 1977 38.4 1.49e7 786. 11.7 571971739. ## 7 Afghan~ Asia 1982 39.9 1.29e7 978. 12.6 513391895. ## 8 Afghan~ Asia 1987 40.8 1.39e7 852. 11.8 566117741. ## 9 Afghan~ Asia 1992 41.7 1.63e7 649. 10.6 680033040. ## 10 Afghan~ Asia 1997 41.8 2.22e7 635. 14.1 928283533. ## # ... with 1,694 more rows, and 1 more variable: life_gdp <dbl>
Can use newly created variables IMMEDIATELY!!
# single variables
mutate(gapminder, gdp_billion = gdpPercap*pop/10^9)
# multiple variables
mutate(gapminder,
gdp_billion = gdpPercap*pop/10^9,
life_years = lifeExp * pop,
life_gdp = life_years / gdp_billion)
Practical will build on previous verbs and the pipe!
Single parameter ascending arrange
# ascending sort arrange(gapminder, lifeExp) # descending sort arrange(gapminder, desc(lifeExp))
arrange(gapminder, desc(lifeExp), pop)
## # A tibble: 1,704 x 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Japan Asia 2007 82.6 127467972 31656. ## 2 Hong Kong, China Asia 2007 82.2 6980412 39725. ## 3 Japan Asia 2002 82 127065841 28605. ## 4 Iceland Europe 2007 81.8 301931 36181. ## 5 Switzerland Europe 2007 81.7 7554661 37506. ## 6 Hong Kong, China Asia 2002 81.5 6762476 30209. ## 7 Australia Oceania 2007 81.2 20434176 34435. ## 8 Spain Europe 2007 80.9 40448191 28821. ## 9 Sweden Europe 2007 80.9 9031088 33860. ## 10 Israel Asia 2007 80.7 6426679 25523. ## # ... with 1,694 more rows
# single parameter ascending arrange arrange(gapminder, lifeExp) # single parameter descending arrange arrange(gapminder, desc(lifeExp)) # multiple parameter arrange arrange(gapminder, desc(lifeExp), pop)
Practical will build on previous verbs and the pipe!
Group together rows of data by some factor within the dataset.
Alone, it is hard to see what this command is doing
BUT it is INCREDIBLY useful and VERY powerful
group_by(gapminder, continent)
## # A tibble: 1,704 x 6 ## # Groups: continent [5] ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. ## 2 Afghanistan Asia 1957 30.3 9240934 821. ## 3 Afghanistan Asia 1962 32.0 10267083 853. ## 4 Afghanistan Asia 1967 34.0 11537966 836. ## 5 Afghanistan Asia 1972 36.1 13079460 740. ## 6 Afghanistan Asia 1977 38.4 14880372 786. ## 7 Afghanistan Asia 1982 39.9 12881816 978. ## 8 Afghanistan Asia 1987 40.8 13867957 852. ## 9 Afghanistan Asia 1992 41.7 16317921 649. ## 10 Afghanistan Asia 1997 41.8 22227415 635. ## # ... with 1,694 more rows
Once a dataset is grouped, dplyr performs all operations on that group
Its like each group has become its own dataset
Image taken from Software Carpentry Foundation (https://software-carpentry.org/)
Each column that you want to group by is listed separated by a , (comma)
Order is important!
Below will first group by country and then continent
group_by(gapminder, country, continent)
## # A tibble: 1,704 x 6 ## # Groups: country, continent [142] ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. ## 2 Afghanistan Asia 1957 30.3 9240934 821. ## 3 Afghanistan Asia 1962 32.0 10267083 853. ## 4 Afghanistan Asia 1967 34.0 11537966 836. ## 5 Afghanistan Asia 1972 36.1 13079460 740. ## 6 Afghanistan Asia 1977 38.4 14880372 786. ## 7 Afghanistan Asia 1982 39.9 12881816 978. ## 8 Afghanistan Asia 1987 40.8 13867957 852. ## 9 Afghanistan Asia 1992 41.7 16317921 649. ## 10 Afghanistan Asia 1997 41.8 22227415 635. ## # ... with 1,694 more rows
summarize() also works
# get mean of life expectancy summarise(gapminder, mean_lifexp = mean(lifeExp))
## # A tibble: 1 x 1 ## mean_lifexp ## <dbl> ## 1 59.5
summarise() returns a dataframe
summarize() also works
Each summary statistic is separated by a , (comma)
# get mean of life expectancy and mean of GDP
summarise(gapminder,
mean_lifexp = mean(lifeExp),
mean_gdp = mean(gdpPercap))
## # A tibble: 1 x 2 ## mean_lifexp mean_gdp ## <dbl> <dbl> ## 1 59.5 7215.
These are not the only functions that work with summarise()
| Function | Description |
|---|---|
| min() | minimum |
| max() | maximum |
| sum() | sum |
| range() | range |
| mean() | mean |
| median() | median |
| sd() | standard deviation |
| var() | variance |
See http://www.statmethods.net/management/functions.html for more information/options
n() counts number of observations
n_distinct() counts the number of distinct observation for that column
gapminder %>%
summarise(n = n(),
n_distinct = n_distinct(country))
## # A tibble: 1 x 2 ## n n_distinct ## <int> <int> ## 1 1704 142
Using group_by() with summarise() returns the summary statistic for each member of the group
gapminder %>%
group_by(continent) %>%
summarise(mean_lifeExp = mean(lifeExp),
median_gdp = median(gdpPercap))
## # A tibble: 5 x 3 ## continent mean_lifeExp median_gdp ## <fct> <dbl> <dbl> ## 1 Africa 48.9 1192. ## 2 Americas 64.7 5466. ## 3 Asia 60.1 2647. ## 4 Europe 71.9 12082. ## 5 Oceania 74.3 17983.
Image taken from Software Carpentry Foundation (https://software-carpentry.org/)
n() counts number of observations
n_distinct() counts the number of distinct observation for that column
gapminder %>%
group_by(continent) %>%
summarise(n = n(),
n_distinct = n_distinct(country))
## # A tibble: 5 x 3 ## continent n n_distinct ## <fct> <int> <int> ## 1 Africa 624 52 ## 2 Americas 300 25 ## 3 Asia 396 33 ## 4 Europe 360 30 ## 5 Oceania 24 2
get number of observations for each group - while keeping the entire original dataframe
Use to subset for duplicates, minimum/maximum occurrences, etc.
gapminder %>% group_by(country) %>% add_tally()
## # A tibble: 1,704 x 7 ## # Groups: country [142] ## country continent year lifeExp pop gdpPercap n ## <fct> <fct> <int> <dbl> <int> <dbl> <int> ## 1 Afghanistan Asia 1952 28.8 8425333 779. 12 ## 2 Afghanistan Asia 1957 30.3 9240934 821. 12 ## 3 Afghanistan Asia 1962 32.0 10267083 853. 12 ## 4 Afghanistan Asia 1967 34.0 11537966 836. 12 ## 5 Afghanistan Asia 1972 36.1 13079460 740. 12 ## 6 Afghanistan Asia 1977 38.4 14880372 786. 12 ## 7 Afghanistan Asia 1982 39.9 12881816 978. 12 ## 8 Afghanistan Asia 1987 40.8 13867957 852. 12 ## 9 Afghanistan Asia 1992 41.7 16317921 649. 12 ## 10 Afghanistan Asia 1997 41.8 22227415 635. 12 ## # ... with 1,694 more rows
group_by() comes in handy for more than just summarise()
Combine group_by() and filter() to filter the data by groups
gapminder %>% group_by(continent) %>% filter(lifeExp == max(lifeExp) | pop == min(pop))
## # A tibble: 10 x 6 ## # Groups: continent [5] ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Australia Oceania 2007 81.2 20434176 34435. ## 2 Bahrain Asia 1952 50.9 120447 9867. ## 3 Canada Americas 2007 80.7 33390141 36319. ## 4 Iceland Europe 1952 72.5 147962 7268. ## 5 Iceland Europe 2007 81.8 301931 36181. ## 6 Japan Asia 2007 82.6 127467972 31656. ## 7 New Zealand Oceania 1952 69.4 1994794 10557. ## 8 Reunion Africa 2007 76.4 798094 7670. ## 9 Sao Tome and Principe Africa 1952 46.5 60011 880. ## 10 Trinidad and Tobago Americas 1952 59.1 662850 3023.
Say I want the first and last occurrence of each group sprted by some element
row_number() returns the row that you specify
n() refers to the total number of observations - can be different for each group!
gapminder %>% group_by(country) %>% arrange(year) %>% filter(row_number() == 1 | row_number() == n() ) %>% arrange(country)
## # A tibble: 284 x 6 ## # Groups: country [142] ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. ## 2 Afghanistan Asia 2007 43.8 31889923 975. ## 3 Albania Europe 1952 55.2 1282697 1601. ## 4 Albania Europe 2007 76.4 3600523 5937. ## 5 Algeria Africa 1952 43.1 9279525 2449. ## 6 Algeria Africa 2007 72.3 33333216 6223. ## 7 Angola Africa 1952 30.0 4232095 3521. ## 8 Angola Africa 2007 42.7 12420476 4797. ## 9 Argentina Americas 1952 62.5 17876956 5911. ## 10 Argentina Americas 2007 75.3 40301927 12779. ## # ... with 274 more rows
Grouped dataframes remained grouped
Explicitly need to ungroup() the data
gap <- gapminder %>% group_by(continent) %>% filter(lifeExp == max(lifeExp) | pop == min(pop)) summarise(gap, mean_gdpPercap = mean(gdpPercap))
## # A tibble: 5 x 2 ## continent mean_gdpPercap ## <fct> <dbl> ## 1 Africa 4275. ## 2 Americas 19671. ## 3 Asia 20762. ## 4 Europe 21724. ## 5 Oceania 22496.
Grouped dataframes remained grouped
Explicitly need to ungroup() the data
gap <- gapminder %>% group_by(continent) %>% filter(lifeExp == max(lifeExp) | pop == min(pop)) %>% ungroup() summarise(gap, mean_gdpPercap = mean(gdpPercap))
## # A tibble: 1 x 1 ## mean_gdpPercap ## <dbl> ## 1 17786.
gapminder %>%
group_by(continent) %>%
summarise(mean_lifeExp = mean(lifeExp),
median_gdp = median(gdpPercap))
Practical will build on previous verbs and the pipe!
Syntax for dplyr is more user friendly than base R
Start asking complex questions within a few lines of code
https://software-carpentry.org/
https://4va.github.io/biodatasci/index.html